.bam.bai files.Below, we will go through an example starting from raw ChIP-Seq data deposited in the SRA database and converting it to a bigWig file that can be used to visualize ChIP signals across the genome! We will recreate some of the analysis from Tsankov et. al (Nature 2015) which reported the genome-wide binding data for 38 transcription factors with extensive epigenome and transcriptional data across the differentiation of human embryonic stem cells to the three germ layers. The raw data can be accessed from the Gene Expression Omnibus (GEO) database GSE61475 or Bioproject PRJNA261253.
We will be analyzing the following files: SRR1576335, SRR1576337, SRR1576338, SRR1576340, SRR1576341 which contain Nanog binding data for human embryonic stem cells and the three germ layers.
Deeptools can be loaded on Quest using the command module load deeptools/3.5.1
To use the python package API install the deeptools package using conda or pip.
Deeptools can be used for file processing, quality control and data visualization as shown below.
Deeptools for file processing
multiBigwigSummary Given typically two or more bigWig files, multiBigwigSummary computes the average scores for each of the files in every genomic region. This analysis is performed for the entire genome by running the program in bins mode, or for certain user selected regions in BED-file mode. Most commonly, the default output of multiBigwigSummary (a compressed numpy array, .npz) is used by other tools such as plotCorrelation or plotPCA for visualization and diagnostic purposes.correctGCBias This tool corrects the GC-bias using the method proposed by Benjamini & Speed (2012). Nucleic Acids Research, 40(10). It will remove reads from regions with too high coverage compared to the expected values (typically GC-rich regions) and will add reads to regions where too few reads are seen (typically AT-rich regions). The tool computeGCBias needs to be run first to generate the frequency table needed here.bamCoverage This tool takes an alignment of reads or fragments as input (BAM file) and generates a coverage track (bigWig or bedGraph) as output. The coverage is calculated as the number of reads per bin, where bins are short consecutive counting windows of a defined size.bigwigCompare This tool compares two bigWig files based on the number of mapped reads. To compare the bigWig files, the genome is partitioned into bins of equal size, then the number of reads found in each BAM file are counted per bin and finally a summary value is reported. This value can be the ratio of the number of readsper bin, the log2 of the ratio, the sum or the difference.computeMatrix This tool calculates scores per genome regions and prepares an intermediate file that can be used with plotHeatmap and plotProfiles. Typically, the genome regions are genes, but any other regions defined in a BED file can be used. computeMatrix accepts multiple score files (bigWig format) and multiple regions files (BED format). This tool can also be used to filter and sort regions according to their score.Deeptools for Quality Control
plotCorrelation Tool for the analysis and visualization of sample correlations based on the output of multiBamSummary or multiBigwigSummary. Pearson or Spearman methods are available to compute correlation coefficients. Results can be saved as multiple scatter plots depicting the pairwise correlations or as a clustered heatmap, where the colors represent the correlation coefficients and the clusters are constructed using complete linkage. Optionally, the values can be saved as tables, too.plotPCA Tool for generating a principal component analysis (PCA) plot from multiBamSummary or multiBigwigSummary output. By default, the loadings for each sample in each principal component is plotted. If the data is transposed, the projections of each sample on the requested principal components is plotted instead.plotFingerprint This tool samples indexed BAM files and plots a profile of cumulative read coverages for each. All reads overlapping a window (bin) of the specified length are counted; these counts are sorted and the cumulative sum is finally plotted.bamPEFragmentSize This tool calculates the fragment sizes for read pairs given a BAM file from paired-end sequencing.Several regions are sampled depending on the size of the genome and number of processors to estimate thesummary statistics on the fragment lengths. Properly paired reads are preferred for computation, i.e., it will only use discordant pairs if no concordant alignments overlap with a given region. The default setting simply prints the summary statistics to the screen.computeGCBias Computes the GC-bias using Benjamini’s method [Benjamini & Speed (2012). Nucleic Acids Research, 40(10). doi: 10.1093/nar/gks001]. The GC-bias is visualized and the resulting table can be used tocorrect the bias with correctGCBias.plotCoverage This tool is useful to assess the sequencing depth of a given sample. It samples 1 million bp, counts the number of overlapping reads and can report a histogram that tells you how many bases are covered how many times. Multiple BAM files are accepted, but they all should correspond to the same genome assembly.Deeptools for data visualization
plotHeatmap This tool creates a heatmap for scores associated with genomic regions. The program requires a matrix file generated by the tool computeMatrix.plotProfile This tool creates a profile plot for scores over sets of genomic regions. Typically, these regions are genes, but any other regions defined in BED will work. A matrix generated by computeMatrix is required.plotEnrichment Tool for calculating and plotting the signal enrichment in either regions in BED format or feature types (column 3) in GTF format. The underlying datapoints can also be output. Metrics are plotted as a fraction of total reads. Regions in a BED file are assigned to the ‘peak’ feature.QC is an important part of any biological experiment. To ensure that replicate conditions across batches are more similar to each other than distinct conditions from the same batch, we will run a PCA.
bamCoverage to generate bigWig files from bam input filesmultiBigwigSummary to compute the average scores for each of the files in every genomic regionplotPCA to visualize the summary datafrom IPython.display import Image
Image("ChIP-seq/plotPCAoutput.png", embed=True, width=800)
Since we are looking at the transcription factor Nanog, we might want to map it's enrichment on TSS regions.
computeMatrix to calculate scores per genome regions and prepare an intermediate file that can be used with plotHeatmap and plotProfiles.plotProfile to plot scores over sets of genomic regions.plotHeatmap to visualize the heatmap for scores associated with genomic regionsCompute the matrix for Nanog binding to TSS in differentiated ectodermal cells whose data is contained in file dEC_Nanog.bw
from IPython.display import Image
Image("ChIP-seq/plotProfile_dEC_Nanog.png", embed=True, retina=True)
from IPython.display import Image
Image("ChIP-seq/plotHeatmap_dEC_Nanog.png", embed=True, retina=True)
Each row is the TSS of a single gene in this heatmap, and the color represents Nanog binding ChIP-Seq signal for that region.
As expected Nanog is a transcription factor, therefore its signal is highest at the TSS. However, not all genes show a signal, since Nanog will only bind to its target genes.
Compute the matrix for Nanog binding to the TSS for all the experimental runs
This will take longer since the computations are fairly intensive
From the plots below, we see that Nanog binding decreases on differentiation. This is expected since Nanog is a pluripotency factor that inhibits differentiation. Further, the levels of Nanog ChIP-Seq signal are very different in the three putative germ layers!
from IPython.display import Image
Image("ChIP-seq/plotNanogHeatmap.png", embed=True, retina=True)
We can also apply KMeans clustering on the heatmap profiles to separate gene-clusters with differential binding patterns
from IPython.display import Image
Image("ChIP-seq/plotNanogHeatmapKMeans.png", embed=True, retina=True)
You can also change the colors for each heatmap using the colorMap or colorList parameter. As well as remove thebounding boxes with boxAroundHeatmaps
from IPython.display import Image
Image("ChIP-seq/plotNanogHeatmapKMeans_colorlist.png", embed=True, retina=True)